Goto

Collaborating Authors

 reasoning skill




WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models

Neural Information Processing Systems

While vision-and-language models perform well on tasks such as visual question answering, they struggle when it comes to basic human commonsense reasoning skills. In this work, we introduce WinoGAViL: an online game of vision-and-language associations (e.g., between werewolves and a full moon), used as a dynamic evaluation benchmark. Inspired by the popular card game Codenames, a spymaster gives a textual cue related to several visual candidates, and another player tries to identify them. Human players are rewarded for creating associations that are challenging for a rival AI model but still solvable by other human players. We use the game to collect 3.5K instances, finding that they are intuitive for humans (>90% Jaccard index) but challenging for state-of-the-art AI models, where the best model (ViLT) achieves a score of 52%, succeeding mostly where the cue is visually salient. Our analysis as well as the feedback we collect from players indicate that the collected associations require diverse reasoning skills, including general knowledge, common sense, abstraction, and more. We release the dataset, the code and the interactive game, allowing future data collection that can be used to develop models with better association abilities.


Leap-Of-Thought: Teaching Pre-Trained Models to Systematically Reason Over Implicit Knowledge

Neural Information Processing Systems

Evidence suggests that large pre-trained language models (LMs) acquire some reasoning capacity, but this ability is difficult to control. Recently, it has been shown that Transformer-based models succeed in consistent reasoning over explicit symbolic facts, under a closed-world assumption. However, in an open-domain setup, it is desirable to tap into the vast reservoir of implicit knowledge already encoded in the parameters of pre-trained LMs. In this work, we provide a first demonstration that LMs can be trained to reliably perform systematic reasoning combining both implicit, pre-trained knowledge and explicit natural language statements. To do this, we describe a procedure for automatically generating datasets that teach a model new reasoning skills, and demonstrate that models learn to effectively perform inference which involves implicit taxonomic and world knowledge, chaining and counting. Finally, we show that teaching models to reason generalizes beyond the training distribution: they successfully compose the usage of multiple reasoning skills in single examples. Our work paves a path towards open-domain systems that constantly improve by interacting with users who can instantly correct a model by adding simple natural language statements.


Leveraging Parameter Space Symmetries for Reasoning Skill Transfer in LLMs

Horoi, Stefan, Cho, Sangwoo, Chakraborty, Supriyo, Zhang, Shi-Xiong, Sahu, Sambit, Wolf, Guy, Winata, Genta Indra

arXiv.org Artificial Intelligence

Task arithmetic is a powerful technique for transferring skills between Large Language Models (LLMs), but it often suffers from negative interference when models have diverged during training. We address this limitation by first aligning the models' parameter spaces, leveraging the inherent permutation, rotation, and scaling symmetries of Transformer architectures. We adapt parameter space alignment for modern Grouped-Query Attention (GQA) and SwiGLU layers, exploring both weight-based and activation-based approaches. Using this alignment-first strategy, we successfully transfer advanced reasoning skills to a non-reasoning model. Experiments on challenging reasoning benchmarks show that our method consistently outperforms standard task arithmetic. This work provides an effective approach for merging and transferring specialized skills across evolving LLM families, reducing redundant fine-tuning and enhancing model adaptability.


Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning

Lee, Daeun, Yoon, Jaehong, Cho, Jaemin, Bansal, Mohit

arXiv.org Artificial Intelligence

Recent advances in Chain-of-Thought (CoT) reasoning have improved complex video understanding, but existing methods often struggle to adapt to domain-specific skills (e.g., event detection, spatial relation understanding, emotion understanding) over various video content. To address this, we propose Video-Skill-CoT (a.k.a. Video-SKoT), a framework that automatically constructs and leverages skill-aware CoT supervisions for domain-adaptive video reasoning. First, we construct skill-based CoT annotations: we extract domain-relevant reasoning skills from training questions, cluster them into a shared skill taxonomy, and create detailed multi-step CoT rationale tailored to each video-question pair for training. Second, we introduce a skill-specific expert learning framework. Each expert module specializes in a subset of reasoning skills and is trained with lightweight adapters using the collected CoT supervision. We demonstrate the effectiveness of the proposed approach on three video understanding benchmarks, where Video-SKoT consistently outperforms strong baselines. We also provide in-depth analyses on comparing different CoT annotation pipelines and learned skills over multiple video domains.


new environment for benchmarking aspects of physical reasoning in which agents are challenged to solve 2D physics

Neural Information Processing Systems

We thank the reviewers for their detailed and constructive comments. Overall, the reviewers were positive about this contribution and liked the submission: " I generally The task is compelling and the benchmark is well thought out. " [R1]; " I like this paper, as it presents " [R2]; " The benchmark is designed to encourage physical The reviewers also raised concerns, which we will address next. For example, in CLEVR it now seems likely that some models (e.g., Relation Networks) have found shortcut "cheats" It is difficult to characterize what constitutes "intrinsic" difficulty, but by As a whole, the community must "go for recall" since By releasing PHYRE to the public, we hope to see rapid exploration of these good suggestions. We will attempt to improve the clarity.


Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

Nakamura, Taishi, Ishikawa, Satoki, Kawamura, Masaki, Okamoto, Takumi, Nohara, Daisuke, Suzuki, Jun, Yokota, Rio

arXiv.org Artificial Intelligence

Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new sparsity dimension that current dense-model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization skills and reasoning skills. By training MoE families that vary total parameters, active parameters, and top-k routing under fixed compute budgets, we disentangle pre-training loss from downstream accuracy. Our results reveal two principles. First, Active FLOPs: models with identical training loss but greater active compute achieve higher reasoning accuracy. Second, T otal tokens per parameter (TPP): memorization tasks improve with more parameters, while reasoning tasks benefit from optimal TPP, indicating that reasoning is data-hungry. Neither reinforcement learning post-training (GRPO) nor increased test-time compute alters these trends. We therefore argue that optimal MoE sparsity must be determined jointly by active FLOPs and TPP, revising the classical picture of compute-optimal scaling. The recent evolution of large language models (LLMs) has been driven by empirical scaling laws (Hestness et al., 2017) that link training loss to model size, dataset size, and compute budget. Kaplan et al. (2020) showed that these laws hold across seven orders of magnitude, establishing them as a reliable extrapolation tool for dense Transformers. Subsequent work by Hoffmann et al. (2022) demonstrated that scaling curves can be inverted to choose the compute-optimal combination of parameters and tokens for a fixed budget. Together, these results have made scaling analysis a cornerstone of model planning at both academic and industrial labs.


Best-of-L: Cross-Lingual Reward Modeling for Mathematical Reasoning

Rajaee, Sara, Choenni, Rochelle, Shutova, Ekaterina, Monz, Christof

arXiv.org Artificial Intelligence

While the reasoning abilities of large language models (LLMs) continue to advance, it remains unclear how such ability varies across languages in multilingual LLMs and whether different languages produce reasoning paths that complement each other. To investigate this question, we train a reward model to rank generated responses for a given question across languages. Our results show that our cross-lingual reward model substantially improves mathematical reasoning performance compared to using reward modeling within a single language, benefiting even high-resource languages. While English often exhibits the highest performance in multilingual models, we find that cross-lingual sampling particularly benefits English under low sampling budgets. Our findings reveal new opportunities to improve multilingual reasoning by leveraging the complementary strengths of diverse languages.